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Abstract 

We discuss a formulation for active example selection for function learning problems. 
This formulation is obtained by adapting Fedorov's optimal experiment design to the 
learning problem. We specifically show how to analytically derive example selection 
algorithms for certain well defined function classes. We then explore the behavior and 
sample complexity of such active learning algorithms. Finally, we view object detection 
as a special case of function learning and show how our formulation reduces to a useful 
heuristic to choose examples to reduce the generalization error. 
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1 Introduction 

Many problems in disparate fields like computer vision, finance, natural language processing are 
increasingly being approached from a machine learning perspective. Typically, there is some task 
to be performed like object recognition/detection or stock market prediction and the learner has 
access to data relevant to this task. On the basis of this data set the learner develops hypotheses 
and uses these to perform the task. 

In most classical formulations of learning from examples, the data (examples) are assumed to 
be randomly drawn and presented to the learner. This is the case for for a variety of situations 
ranging from network models [16, 14], PAC [23] frameworks, and classical pattern recognition. In 
this sense, the learner is a passive recipient of information about the target concept. 

In contrast, one could consider a learner that plays a more active role in collecting its exam- 
ples. In this paper, we take a formal look at the problem of selecting high utility examples for 
machine learning systems. The example selection problem falls under a newly emerging general 
area of research, called active learning, that investigates how learners can pose intelligent queries 
to teachers under various learning scenarios, to achieve "better" learning results. Active learn- 
ing differs from traditional example-based learning paradigms in the following way: Rather than 
passively accepting training examples that randomly describe a target concept, an active learner 
uses information derived from its current state and prior knowledge about the target concept to 
intelligently gather useful examples from specific input space locations for further training. By care- 
fully generating intelligent queries instead of performing random sampling, one can expect active 
learning techniques to have faster learning rates and better approximation results than traditional 
example-based learning algorithms. 

Our main focus is on active example selection strategies for a function approximation based 
learning framework. Specifically, we address the following three questions: 

1. Given a function approximation based learning task and some prior information about the 
target function, are there principled strategies for selecting useful training data in some 
"optimal" fashion? 

2. Assuming such principled data selection strategies do exist, do these active strategies require 
fewer examples than classical learning techniques to approximate target functions to the same 
degree of accuracy? 

3. Can one directly apply these active example selection strategies to real-world function ap- 
proximation learning tasks or easily adapt them into more feasible forms without losing too 
much of their original flavor? 

Using ideas from Optimal Experiment Design [6], we begin by proposing an active example 
selection formulation for function approximation tasks and show that one can indeed select high 
utility examples for a given task in a principled and "optimal" fashion. MacKay [10] and Cohn 
[5] have adopted a similar formalism for active learning and we comment on differences with their 
work at appropriate points in the future. More recently, Sollich [19] has arrived at a very similar 
formulation motivated by statistical mechanics. 

While the formulation proposed (and the variants suggested by others) are certainly well mo- 
tivated, they are however, computationally intractable in general. In this paper, we show that the 
general formulation can be used to analytically derive precise, tractable, data selection algorithms 
for three specific function approximation classes: (1) unit step functions, (2) polynomial approxi- 
mators and (3) Gaussian radial basis function networks. In addition, we describe some conditions 
on parameterized function classes for which the general formulation can be analytically solved to 
yield active algorithms. For the three function classes considered in this paper, we provide either 
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theoretical or empirical results suggesting that the active strategy learns the target function with 
fewer data examples than random sampling. 

Ultimately, the litmus test of any theoretical framework is whether it can affect the way practical 
learning systems are built. To this effect, we consider a reduced version of the original active learning 
formulation that essentially hunts for new data where approximation "error bars" are high. We 
show how such a scheme, with minor modifications, leads to a practical example selection strategy 
(referred to henceforth as a "boot-strap" strategy). We have adopted this method of choosing 
examples in an object and pattern class detection approach. Although the "boot-strap" strategy 
loses some of the original active learning flavor and may thus be "sub-optimal" in its choice of new 
examples, we show empirically that it still outperforms random sampling in training a frontal face 
detection system, and is therefore still an effective means of dealing with unmanageably large data 
sets to make learning tasks tractable. 

2 Background and Approach 

We start with an overview of active learning and related work. Active learning has appeared 
in various forms throughout knowledge engineering and machine learning literature. One early 
implementation can be found in certain expert systems, where an important component of learning 
relies on issuing queries to the instructor. For example, Sammut and Banerji [17] use queries 
about specific examples as part of a strategy for efficiently learning a target concept. Shapiro's 
Algorithmic Debugging System prompts the user with a variety of query types to locate errors in 
Prolog programs [18]. In computational learning theory, several types of active learning queries have 
also been defined (see for example [3]) and compared with Valiant's probably approximately correct 
(PAC) model of concept identification under random sampling [24]. Angluin [2], for example, has 
shown that there are concept classes that can be efficiently learnt with membership and equivalence 
queries, but not with random sampling in Valiant's PAC model. 

Some early connectionist approaches toward active learning include: Ahmad and Omohundro 
[1] on training networks by selective attention; Hwang et. al. [9] on a query-based neural network 
learning scheme that generates queries near classification boundaries; Plutowski and White [13] 
on an efficient feedforward network training technique that selects new training examples with 
maximum potential utility from among available candidate examples. Both Hwang et. al. [9] 
and Plutowski et. al. [13] choose new training examples according to information derived from a 
partially trained network. 

Plutowski and White [13] examines the learning task from a more general function approxi- 
mation standpoint, viz., approximating a target function, g(x), using a network output function, 
F(w,x), parameterized by weights w. They design their criteria for selecting new examples to 
meet two objectives: (1) to maximize the accuracy of fit between network output, F(w,x), and 
the target function, g(x), and (2) to minimize the approximation's unreliability in the presence of 
noise. The paper quantifies the above two considerations by proposing an Integrated Mean Squared 
Error (IMSE) measure to be minimized: 



IMSE(x n ) = I I [g(x) - F(l n (x n ,y n ),x)} 2 r(dy n \x n )Kdx*) 
E[(g(x) - F(w n ,x)) 2 \x n ]fi(dx), 



where x n , y n are the current n pairs of input-output training examples, l n (x. n , y n ) = w n is the learn- 
ing rule that begets network weights w n from the training examples, 7 n (y n |x n ) is the conditional 
probability of output distribution y n given input distribution x n , _E[-|x n ] is conditional expected 



network output mean squared error given input distribution x n , and fj, is the probability distribu- 
tion over the input space x. To select the next training example, the learning algorithm samples 
at the next input location x^+x that maximally decreases the IMSE. Unfortunately, an obvious 
problem with the approach is that both the IMSE and the analytic expression for its decrement 
(not shown) assume a known target function g(x). This is seldom a reasonable assumption in real 
learning scenarios where the target function is unknown. 

2.1 Regularization Theory and Function Approximation — A Review 

Our main focus in this chapter is on function approximation based active learning. We briefly 
review regularization theory as a lead in to our active learning formulation. 

Let V n = {(xi,yi) £ $l d X $l\i = 1, . . ., n} be a set of n data points obtained by sampling a 
function g, possibly in the presence of noise, where d is the input dimensionality. The function 
approximation task is to recover g, or at least obtain a reasonable estimate of it, by means of an 
approximator g. Clearly, the problem is ill-posed [8] because there can be an infinite number of 
functions that pass through those data points. Some constraints are thus needed to transform the 
problem into a well-posed one. The regularization approach [21] [22] [12] [4] selects a function g 
that minimizes the following functional: 

n 

H[g] = J2(y—9m) 2 + *\\P9\\ 2 - (1) 

8 = 1 

The first term of Equation 1 penalizes discrepancies between the solution, g, and the observed 
data. The second term, usually called a stabilizer, embodies a priori knowledge about the smooth- 
ness of the solution. P is a constraint operator, usually a linear differential operator, and || • || 
stands for a norm on the function space containing g, usually the Ii norm. Together, they favor 
functions that do not vary too quickly on 3? . The regularization parameter, A, determines the 
trade-off between the two terms — data reliability and prior beliefs. Poggio and Girosi have shown 
that the solution to Equation 1 has the following simple form: 

n 

9(%) = J2 c i G (x; %i) + p(x), ( 2 ) 

8 = 1 

where G, p and the coefficients c 8 -, can all be derived from the constraint operator P, the n data 
points (xi,yi), the stabilizer and some boundary conditions (see [14] for details). 

For our purpose, it is convenient to adopt a probabilistic interpretation of regularization that 
treats the function g and the data set V n as random, dependent variables (see [15]). Using Bayes 
rule, we can express the conditional probability of the function g given examples V n , V(g\V n ), in 
terms of the prior probability of g, V(g), and the conditional probability of V n given g, V{V n \g): 

V(g\V n ) a V(V n \g)V(g). (3) 



Equation 3 relates to the regularization functional of Equation 1 as follows: Suppose noise at 
each of the n data points is identically independently Gaussian distributed with variance a 2 . The 
conditional probability, V(V n \g), can be written as: 

V(V n \g) oc exp I - ^ ^(.Vi ~ 9{xij? I • 



Similarly, if g is a stochastic process [11] [7], we can write V{g) as: 

V{g) ex exp (-/ || Pg | 



where / is some fixed constant, P and || • || are as defined earlier. Equation 3 thus becomes: 
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V(g\V n ) = Ke~^=i1Z* (yi ~ 3(Xi » exp -/ || Pg 



n 1 
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where K is some fixed constant. Taking natural logarithms on both sides and performing some 
additional algebra yields: 

n 

-2d 2 In V(g\V n ) + In K = ^(Vi ~ ff(£)) 2 + 2<r 2 / || Pg || 2 , 

i = l 

which is identically Equation 1 with A = 2<7 2 / and H[g] = — 2a 2 lnV(g\V n ) + ln K. So, by choosing 
a function g that minimizes H[g], regularization essentially maximizes the conditional probability 
V{g\V n ). In other words, it chooses: 



g G arg min H [f] = argmaxP(/|X' n ) 

= arg max 7>(X>„ | /)?>(/), 

that is, an a-posteriori most probable function g given the set of examples V n . 

2.2 A Bayesian Framework 

The active learning problem for function approximation can be posed as follows: Let V n = 
{(xi, yi) G R X $l\i = 1, . . . , n} be a set of n data points sampled from an unknown target function 
g, possibly in the presence of noise, where d is the input dimensionality. Given an approximation 
function concept class, J 7 , where each / £ T has prior probability Vj^(f), one can use regularization 
techniques to approximate g from V n (in the Bayes optimal sense) by means of a function g £ T . 
We want a strategy to determine at what input location one should sample the next data point, 
(xn+\,yn+\)i in order to obtain the "best" possible Bayes optimal approximation of the unknown 
target function g with our concept class T . 

One can use ideas from optimal experiment design [6] to approach the active data sampling 
problem in two stages: 

1. Define what we mean by the "best" possible Bayes optimal approximation of 
an unknown target function. We propose an optimality criterion for evaluating the 
"goodness" of a solution with respect to an unknown target function, similar in spirit to the 
cost function, Equation 1, for a known target. 



2. Formalize mathematically the task of determining where in input space to sample 
the next data point. We express the above mentioned optimality criterion as a cost function 
to be minimized, and the task of choosing the next sample as one of minimizing the cost 
function with respect to the input space location of the next sample point. 

Earlier work by Cohn [5] and MacKay [10] have tried using similar optimal experiment design 
techniques to collect data with maximum information about the target function. Our work here 
differs from theirs in two respects. First, we use a different, and perhaps more general, optimality 
criterion for evaluating solutions to an unknown target function. Specifically, our optimality crite- 
rion considers both bias and variance components in the solution's output generalization error. In 
contrast, both MacKay and Cohn use a "less complete" optimality criterion that favors solutions 
with only small variance components in model parameter space. Second, we also examine the im- 
portant sample complexity issue, i.e., does the active strategy require fewer examples than random 
sampling to approximate the target to the same degree of accuracy? After completion of this work, 
we learnt that Sollich [19] had also recently developed a similar formulation to ours, but his analysis 
is conducted in a statistical physics framework. We will review these differences in greater detail 
in a later section. 

3 The Active Learning Formulation 

In order to optimally select examples for a learning task, one should first have a clear notion 
of what an "ideal" learning goal is for the task. One can then measure an example's utility in 
terms of how well the example helps the learner achieve the goal, and devise an active sampling 
strategy that selects examples with maximum potential utility. In this section, we propose one 
such learning goal — to find an approximation function g £ T that "best" estimates the unknown 
target function g. We then derive an example utility cost function for the goal and finally present 
a general procedure for selecting examples. 

3.1 An Optimality Criterion for Learning an Unknown Target Function 

Let g be the target function that we want to estimate by means of an approximation function 
g G T . If the target function g were known, then one natural measure of how well (or badly) g 
approximates g would be their Integrated Squared Difference (ISD) over the input space, $l d , or 
over some appropriate region of interest: 



K9,9)= (g(x) - g(x)) dx. (4) 

Jxe& d 

In most function approximation tasks, the target g is unknown, so we clearly cannot express 
the quality of a learning result in terms of g. We propose an alternative scheme for characterizing 
probabilistically the quality of an approximation result that takes into account only g, the approx- 
imation function itself, and the example data points it approximates, without actually having to 
know g. Here, our objective notion is similar in spirit to the integrated squared difference "misfit" 
criterion described above. We elaborate further on what we mean below: 

Figure 1(a) shows two approximation functions, fa and fa, for a set of data points, V, from 
an unknown target function g. Without further knowledge of the target function, g, one would 
normally guess that fa is a more probable (and hence better) hypothesis for g, because it oscillates 
less between the data points. This aspect of an approximation function's "goodness" has been fully 
captured by regularization, which assigns V(fa\V) a higher likelihood value than V(fa\V). 

Figure 1(b) shows a function, g, that approximates two unknown target functions g\ and gi, 
sampled at V 1 and V 2 respectively. Notice that in this example, the approximator g fits both 
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Figure 1: (a): Two approximation functions, g\ and §2, for the same set of data points sampled from an 
unknown target function. g\ oscillates less between the data points, so one would normally guess that it is a more 
probable hypothesis for the unknown target function, (b): An approximation function, g, for two sets of data 
points sampled from two (possibly different) unknown target functions. Although g fits both sets of data points 
exactly, one might still expect it to more closely resemble the unknown target function of the top system than of 
the bottom system. This is due to the uneven example distribution in the bottom system. 



data sets exactly, so we have g = argmaxf^ViflV 1 ) and g = a,rgraa,Xf e ^V(f\V 2 ). Intuitively 
however, one might still expect the actual misfit between g\ and g to be smaller than the actual 
misfit between #2 an d g. This is because V 1 is a more representative data sample for g-y than V 2 is 
for gi , and in both systems, g is directly derived from V 1 and V 2 respectively. One can view this 
expected misfit notion between an unknown target g and its approximation function g, as a sense of 
"uncertainty" that one has in the current solution. The notion is not captured by the regularization 
framework, and as we shall see, depends instead on the distribution of training examples over the 
input space. 

Since our active learning task is to determine the best input space location for sampling next, a 
reasonable learning goal would be to sample at locations that minimize the expected misfit notion 
between the unknown target g and the resulting approximation g. 

3.2 Evaluating a Solution to an Unknown Target — The Expected Integrated Squared 
Difference 

We now formalize the above expected misfit notion as a mathematical functional to be mini- 
mized. The general idea is as follows: Let T be the approximation function class in our learning 
task. Suppose we treat the unknown target function g as a random variable in J 7 , then one way of 
determining the expected misfit between the regularized solution, g, and the unknown target func- 
tion, g, would be to compute an expected version of some difference measure between them, such 
as their integrated squared difference, S(g,g) (see Equation 4). Taking into account V n , the n data 
points seen so far, and Vj^(g), the prior probability of g in J 7 , we have the following a-posteriori 
likelihood for g: V{g\V n ) oc Vj^{g)V{V n \g). The expected integrated squared difference (EISD) 
between an unknown target, g, and its estimate, g, given V n , is thus: 
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Figure 2: Top Row: A regularized solution, g, for two unknown target functions, g\ (left 
graph) and gi (right graph). The curve g' is an alternative hypothesis for the two unknown 
target functions. There is more evidence against g' being a true hypothesis for g\ than for gi 
because the first system has a data point near the center of the input space where g' differs 
considerably from g and the data point. Bottom Row: Graphs depicting the a-posteriori 
probability distribution of the unknown target in approximation function space for the two 
systems. Because there is more evidence against alternative hypotheses like g' in the first system 
than in the second system, we get a sharper peak for the a-posteriori distribution at g in the 
first system than in the second system. 
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(5) 



The EISD is intuitively pleasing as an "uncertainty" measure for evaluating a solution to an 
unknown target, because its value decreases with better distributed data samples. The following 
example illustrates how the measure agrees well with "human intuition". We return to the two 
function approximation problems described in Figure 1(b). In the first system, one intuitively 
expects a smaller discrepancy between the unknown target and its approximation function than 
in the second system, even though the same regularized estimate g fits both data sets equally 
well. This is because the data samples V 1 in the first system are more evenly (and hence better) 
distributed than the samples V 2 in the second system. We now argue that the EISD measure in 
the first system should indeed be smaller than the EISD measure in the second system. 

Consider the same two systems in the top row of Figure 2, where g is the regularized solution 



for the two unknown target functions g\ and g 2 . The two unknown targets are sampled at V 1 and 
V 2 respectively, and both data sets contain the same number of data points. Consider next an 
alternate hypothesis g' for g\ and g 2 , that differs slightly from the regularized solution g over some 
region of the input space. Because the first system has better distributed data points than the 
second system, there is more evidence against most alternative hypotheses like g' being a viable 
solution for g\ than for g 2 . Mathematically, this means that for most alternative hypotheses like g', 
the ratio V{g x = g'\V v )lV{g x = g\V v ) is smaller than the ratio V(g 2 = g'\V 2 )/V(g 2 = g\V 2 ). One 
can therefore expect E^S^g^g^lV 1 ] < Ejr[S(g,g 2 )\V 2 ], which agrees well with "human intuition". 
The bottom row of Figure 2 depicts the difference between the two systems graphically. Because 
most alternative hypotheses are poor solutions for the first data set V 1 , the first unknown target 
gi has an a-posteriori probability distribution that is heavily weighted around g in approximation 
function space. The same is less true about the a-posteriori probability distribution for g 2 in the 
second system. Thus, g is a more "stable", and hence a more "certain" solution for g\ than for g 2 . 

3.3 Selecting the Next Sample Location 

Let g be the unknown target function that we want to learn, V n = {(a^,y 8 ) £ !R d x!R|i = l,...,ra} 
be the set of n examples seen so far, and g n be the current regularized approximation for g. We 
now formalize the task of determining the best input space location to sample next. Since our 
learning goal is to minimize the expected misfit between g and its regularized solution, a reasonable 
sampling strategy would be to choose the next example from the input location x^+i £ $l d that 
minimizes the EISD between g and its new estimate gn+i- 

How does one predict the new EISD that results from sampling the next data point at location 
Xn+il Suppose we also know the target output value (possibly noisy), y n +i, at x^+x- The EISD 
between g and its new estimate g n +\ would then be Ejr[S(gn+i, g)\T> n U (xy~+i,y n +i)], where gn+i 
can be recovered from V n U (x^+i, y n +i) via regularization. In reality, we do not know y n +i, but 
we can derive its conditional probability distribution from V n , the data samples seen so far. Once 
again, let T be the approximation function class for our learning task and Vj^{f) be the prior 
probability of / in J 7 , then: 



P(y n +i\xn+i,V n ) <x P(V n \J(xn +1 ,y n+1 )\f)Pr(f)df. (6) 

Jfef 

Because y n +i is a random variable and not a fixed value as we had assumed earlier, this leads to 
the following expected value for the new EISD, if we sample our next data point at Xn+i'- 

/oo 
V{y n+ i\xn+i,V n )Ejr[6(gn + i,g)\V n \J (xn +1 ,y n+1 )]dy n+1 . (7) 

-oo 

Notice from Equation 5 that Ejr[S(gn+i, g)\T> n U (j n ^i,|/ n ^i)] in the above expression is actually 
independent of the unknown target function g, and so L((g n+ i\V n , Xn+i) (henceforth referred to as 
the total output uncertainty) is fully computable from available information in the learning model. 
Clearly, the optimal input location to sample next is the location that minimizes L((g n+ i\V n , Xy~+i), 
i.e.: 

Xn +1 = a,rgmmU(g n+1 \V n , Xn +1 ). (8) 



3.4 Summary of the Active Learning Procedure 

We summarize the key steps involved in our active learning strategy for finding the optimal 
next sample location: 

f . Compute V{g\V n ). This is the a-posteriori likelihood of the different functions g given V n , 
the n data points seen so far. 

2. Assume a new point x^+x to sample. 

3. Assume a value y n +i f° r this x^+i- One can compute V{g\V n U (sn+i)2/n+i)) an d hence the 
expected integrated squared difference (EISD) between the target and its new estimate gn+i- 
This is given by E^[8(g n " +1 , g)\V n U (x n ~* +1 ,y n+1 )] (see Equation 5). 

4. At the assumed x^+i, y n +i has a probability distribution given by Equation 6. Averaging 
the resulting EISD over all y n+ i's, we obtain the total output uncertainty for x^+\, given by 
U(gn +1 \V n ,Xn +1 ) in Equation 7. 

5. Sample at the input location x,n+i that minimizes the total output uncertainty cost function 

U{gn+l\Vn-,Xn+l)- 

Some final remarks about our example selection strategy: Intuitively, a reasonable selection 
criterion should choose new examples that provide dense information about the target function g. 
Furthermore, the choice should also take into account the learner's current state, namely V n and 
g n , so as to maximize the net amount of information gained. Our scheme treats an approximation 
function's expected misfit with respect to the unknown target g (i.e. their EISD) as a measure of 
uncertainty in the current solution. It selects new examples, based on the data that it has already 
seen, to minimize the expected value of the resulting EISD measure. In doing so, it essentially 
maximizes the net amount of information gained with each new example. 

Our main results in this active learning formulation are: 

1. a cost function that captures the expected misfit optimality criterion (Equation 5) for evalu- 
ating the solution to an unknown target function, and 

2. a formal specification for the task of selecting new training examples with maximum potential 
utility (Equation 8). 

The developed framework, and the associated observations may, in themselves, be interesting 
from a theoretical standpoint, but in practice, another fundamental concern must also be addressed 
— the computational complexity issue. Both Equations 5 and 8, though theoretically computable 
from available information in the learning model, are clearly intractable in their current form. Nev- 
ertheless, we maintain the formulation still serves as a possible "optimal" benchmark for evaluating 
other active example selection schemes. Later in this paper, we shall consider a reduced version 
of the original function approximation based active learning formulation that essentially hunts for 
new data where approximation "error bars" are high. We also show how such a scheme, with 
minor modifications, leads to a "boot-strap" example selection strategy we have adopted to useful 
advantage in an object and pattern class detection approach that we have developed. 

3.5 Previous Frameworks Revisited 

At this point, we are in a position to make concrete the differences between the technique devel- 
oped here and those adopted by MacKay [10] and Sollich [19]. Recall that our active formulation 
requires the computation of two expectations. One is over the a-posteriori distribution over the 
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function space J 7 , the other is over the a-posteriori distribution over the space of y n +i' s one would 
expect given the data and a proposed sample location (x n+ i). Specifically, 

U(x n+1 ) = Ep (yn+llDnjXn+l) E P(glDnU(xn+uyn+iy) S(g,g) 

Here g ranges over all the functions in T and g is the MAP solution that the learner would use 
in practice in such a Bayesian setting. 

In contrast, MacKay uses the following criterion: 

U = Ep(y\D n ,x n+1 )Ep(Z\D n U(x n+1 ,y n+1 )) \\ 3 - 3^ || 

He assumes that T is a parameterized family with parameters denoted by a. Correspondingly, 
a^ is the mean parameter value (averaged over the a-posteriori density over parameter space, i.e., 
Ou = Epi^\jj nLS i Tn y \\3))). Notice that his criterion operates in the parameter space rather 
than the true function space. If, closeness in parameter space is linearly proportional to closeness 
in function space, then the criterion would be equivalent to the following (where g is the mean over 
the function space): 

E P(y„ +1 \D n ,x„ +1 ) E P(g\D n U(x„ +1 ,y„ +1 ))d(g,g) 

Finally, Sollich uses the following criterion: 

U = E P(y„+i\D n ,x„ +1 )Ep( g \D n u(x„ +1 ,y„ +1 ))Ep(h\D n U(x„ +1 ,y„ +1 ))H9,h) 

He distinguishes the space of target functions T from the space of hypothesis functions (say 
7i). One could then compute the a-posteriori distributions over both the target space and the 
hypothesis space given a set of data. These a-posteriori distributions are used to compute three 
averages in the sense above. 

Having now reproduced each of the above frameworks in a common, consistent notation, we 
make the following observations. 

First, it is clear that the three frameworks differ slightly. MacKay 's criterion, which computes 
"spread" around the mean of the a-posteriori distribution in parameter space is the variance of the 
a-posteriori distribution in a strictly correct sense. He ignores the bias. Sollich's criterion (assuming 
T = 7i) reduces to ours if one assumes that P(h\D) is a Kronecker- delta centered on g, the MAP 
solution. The two inner expections reduce essentially to just one non-trivial one; this condition is 
referred to as "deterministic" learning as opposed to the stochastic "Gibbs" learning framework 
seemingly consistent with the statistical mechanics approach. We have preferred to use the MAP 
solution in our framework since in practice, that is what is used in most learning systems. 

Finally, like Sollich, we use the output generalization error (S) in our approach. In reality, one 
is interested in choosing examples to reduce the total generalization error. Using a metric based on 
the parameter space is only an indirect measure of this generalization error. In the case of function 
classes sufficiently non-linear in their parameters this is likely to give suboptimal performance. 

We now return to the main subject of investigation, i.e., sample complexity, analytical tractabil- 
ity, and practical utility. 

4 Comparing Sample Complexity 

To demonstrate the usefulness of the above active learning procedure, we show analytically and 
empirically that the active strategy learns target functions with fewer data examples than random 
sampling for three specific approximation function classes: (1) unit step functions, (2) polynomial 
approximators and (3) Gaussian radial basis function networks. For all three function classes, one 
can derive exact analytic data selection algorithms following the key steps outlined in Section 3.4. 
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4.1 Unit Step Functions 

We first consider the following simple class of one- dimensional unit-step functions described by 
a single parameter a which takes values in [0, 1]. Let us denote the unit-step function by: 



u[x — a) 



1 if x > a 
otherwise 



The target and approximation function class for this problem is given by: 

T = {u(x -a)|0 < a < 1} 



Assuming a has an a-priori uniform distribution on [0, 1], we obtain the following prior distribution 
on the approximation function class: 



Vj^{g = u(x — a)) 



1 if < a < 1 
otherwise 



Suppose we have a noiseless data set, V n = {(xi, yi); i = 1, ..n}, consistent with some unknown 
target function g = u(x — a) that the learner has to approximate. We want to find the best input 
location to sample next, x £ [0,1], that would provide us with maximal information. Let xr be the 
right most point in V n whose y value is 0, i.e., xr = THSuXi = i^ /n {xi\yi = 0} (see Figure 3). Similarly, 
let xl = mm.i = \^. n {xi\yi = 1} and w = xl — xr. Following the general procedure outlined in 
Section 3.4, we go through the following steps: 



1. Derive V{g\V n ). One can show that: 



V{g 



U[ X 



a)\Vr. 



1 if a e [x R , x L ] 
otherwise 



2. Suppose we sample next at a particular x £ [0, 1], we would obtain y with the distribution: 

if a; £ [x R ,x L ] 



P(y = 0\V n ,>. 



(x L -x) _ (x L -x) 

X L -X R ~ w 

1 if X < XR 

otherwise 
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P(y=l\V n ,x)= { 



1 if X > XL 

otherwise 



3. For a particular y, the new data set would be V n+ i = V n U (x, y) and the corresponding EISD 
can be easily obtained using the distribution V(g\V n +i). Averaging this over V(y\V n ,x) as 
in step 4 of the general procedure, we obtain: 



U(gn+i\T>n,: 



j2 if x < xr or x > xl 

j^((xl - xf + (x - x R f) otherwise 



4. Clearly the new input location that minimizes the total output uncertainty, U(gn+i\V n ,x), 
measure is the midpoint between xl and xr\ 

■ 111 - IT) \ X L +X R 

x n+1 = arg mm U(g n+1 \V n ,x) = . 

^£[0,1] z 



Thus, by applying the general procedure to this trivial case of one- dimensional unit-step func- 
tions, we get the familiar binary search learning algorithm that queries the midpoint of xr and xl- 
For this function class, one can show analytically in PAC-style [23] that this active data sampling 
strategy takes fewer examples to learn an unknown target function to a given level of total output 
uncertainty than randomly drawing examples according to a uniform distribution in x. 

Theorem 1 Suppose we want to collect examples so that we are guaranteed with high probability 
(i.e. probability > 1 — 8) that the total output uncertainty is less than e. Then a passive learner 
would require at least , 1 ln(l/8) examples while the active strategy described earlier would require 
at most (l/2)ln(l/12e) examples. 

4.2 Polynomial Approximators 

We consider next a univariate polynomial target and approximation function class with maxi- 
mum degree K , i.e.: 

K 
T = {g(x,a) = g(x,a ,...,a K ) = ^a % x % }. 

8=0 

The model parameters to be learnt are a = [ao «i • • • «A'] T and x is the input variable. We obtain 
a prior distribution for T by assuming a zero-mean Gaussian distribution with covariance Sjr on 
the model parameters a: 

1 1 T 

Vr(g(;S)) = V T {a) = (27r)(A - +1)/2|s ^ |1/2 exp(--a S^ a). (9) 
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Our task is to approximate an unknown target function g £ T within the input range [^lo^hi] 
on the basis of sampled data. Let V n = {(xi, yi = g{xi) + r])\i = 1, . . . , n} be a noisy data sample 
from the unknown target in the input range [slo, ^Hl], where r] is an additive zero-mean Gaussian 
noise term with variance a 2 . We compare two different ways of selecting the next data point: (f) 
sampling the function at a random point x according to a uniform distribution in [^lo^hi] (i-e. 
passive learning), and (2) using our active learning framework to derive an exact algorithm for 
determining the next sampled point. 

4.2.1 The Active Strategy 

Here, we go through the general active learning procedure outlined in Section 3.4 to derive an 
exact expression for Xn+i, the next query point. We summarize the key derivation steps below: 

1. Let x; = [1 X{ x 2 . . . xfY be a power vector of the i th data sample's input value. One can 
show (see Appendix A. 1.1) that the a-posteriori approximation function class distribution, 
V(3\V n ), is a multivariate Gaussian centered at 3 with covariance T, n , where: 

1 n 
3 = S n (— r J^XiVi) 



and: 

1 n 
^ = ^ + 3E(^ T )' (10) 

s i = l 

2. Deriving the total output uncertainty expression L((gn+i\T> n , x n+ i) requires several steps (see 
Appendix A. 1.2 and A. 1.3). Taking advantage of the Gaussian distribution on both the 
parameters 3 and the noise term, we eventually get: 

U(gn+l\T>n,Xn+l) = 1^+1 A | OC |S n+ l|, (11) 

where A is a constant (K + 1) X (K + 1) matrix of numbers whose (i,j) element is: 

a,-,-= r m t^- 2 ut 



' C L0 



S„+i has the same form as T, n and depends on the previous data, the priors, noise and the 
next sample location x n+ i . When minimized over x n+ i , we get x^+i as the maximum utility 
location where the active learner should next sample the unknown target function. 

4.2.2 Simulations — Error Rate versus Number of Examples 

We perform some simulations to compare the active strategy's sample complexity with that 
of a passive learner which receives uniformly distributed random training examples on the input 
domain [slo, ^Hl]- In this experiment, we investigate whether our active example selection strategy 
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learns an unknown target to a smaller average error rate than the passive strategy for the same 
number of data samples. The experiment proceeds as follows: 

We randomly generate 1000 target polynomial functions using a fixed Gaussian prior on the 
model parameters a = [ao <i\ ■ ■ ■ «A'] T - For each target polynomial, we collect data sets with noisy 
y values ranging from 3 to 50 samples in size, using both the active and passive sampling strate- 
gies. We then assume the same Gaussian priors on the approximation function class to obtain a 
regularized estimate of the target polynomial for each data set. Because we know the actual target 
polynomial for each data set, one can compute the actual integrated squared difference between 
the target and its estimate as an approximation error measure. We compare the two sampling 
strategies by separately averaging their approximation error rates for each data sample size over 
the 1000 different target polynomials. 

In our simulations, we use polynomials of maximum degree K = 9, distributed according to the 
following independent Gaussian priors on model parameters: for each cij in a = [ao a\ ... ag] , we 
have: 




V ( a i) = 7S= ex P 

where Uj = 0.9 J+1 . In other words, Sjr of Equation 9 is a 10 X 10 diagonal covariance matrix such 
that: 

M*,i) = I : Li = °- 928 *\ =i . (12) 

>v ,JJ [ otherwise v ; 

Qualitatively, our priors favor smooth functions by assigning higher probabilities to polynomials 
with smaller coefficients, especially for higher powers of x. We also fix the input domain to be 
[x L0 ,x HI ] = [-5,5]. 

Figure 4 shows the average integrated squared difference between the 1000 randomly generated 
target polynomials and their regularized estimates for different data sample sizes. We repeated 
the same simulations three times, each with a different output noise variance in the data samples: 
a s = 0.1, 1.0 and 5.0. Notice that the active strategy has a lower average error rate than the passive 
strategy particularly for smaller data samples. From this experiment, one can conclude empirically 
that our active sampling strategy learns with fewer data samples than random sampling even when 
dealing with noisy data. 

4.2.3 Simulations — Incorrect Priors 

So far we have assumed that the approximation function class T that the Bayesian learner uses 
is identical to the target class: in other words, the learner has correct prior knowledge about the 
target. This is highly questionable in a real world situation where little is known about the target 
function and its properties and the learner is most likely to have an inaccurate prior model. We 
now investigate how the active learning strategy behaves if the this is the case. Specifically, we 
consider the following three scenarios: 

In the first case, the active learner assumes a higher polynomial degree with similar but slightly 
larger Gaussian variances than the true target priors. We use a 9 degree (i.e. K = 9) polynomial 
function class with Gaussian priors Oj = 0.9 J+1 to approximate an unknown 8 th degree (i.e. K = 8) 
target polynomial with Gaussian priors Uj = 0.8 J+1 . Qualitatively, the approximation function class 
is more complex and favors smooth estimates less strongly than the target class. 

The second case deals with the exact opposite scenario. The active learner uses a lower degree 
polynomial with similar but slightly smaller Gaussian variances (K = 7 and Uj = 0.7 J+1 ) to 
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Figure 4: Comparing active and passive learning average error rates at different output noise ievefs for polyno- 
mials of maximum degree K = 9. We use the same priors on the target and approximation function classes. The 
three graphs above plot log error rates against number of samples. See text for detailed explanation. The dark 
and light curves are the active and passive learning error rates respectively. 
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Figure 5: Comparing active and passive learning average error rates for slightly different priors between the 
target and approximation function classes. Top: Results for the first case. The approximation function class 
uses a higher degree polynomial with larger Gaussian variances on its coefficients (K = 9 and a 3 = 0.9 J ) versus 



(K = 8 and <j 3 



0. 



Middle: The approximation function class uses a lower degree polynomial with smaller 



Gaussian variances on its coefficients (K = 7 and a 3 = 0.7 J ) versus (K = 8 and a 3 = 0.8-' + 1 ). Bottom: The 
approximation and target polynomial function classes have smoothness priors that differ in form. In all three 
cases, the active learning strategy still results in lower approximation error rates than random sampling for the 
same number of data points. 
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approximate an unknown 8 th degree (i.e. K = 8) target with Gaussian priors Oj = 0.8 J+1 . Here, 
the approximation function class is less complex and favors smooth estimates more strongly than 
the target class. 

In the third case, we consider a polynomial approximation function class T whose prior dis- 
tribution has a different form from that of the target class. Let p £ T be a polynomial in the 
approximation function class. One can quantify the overall "smoothness" of p by integrating its 
squared first derivative over the input domain [slo, £hi] : 



Q(p(-,«)) 

' C L0 



HI [dp(x, a) 



dx 



2 



dx. (13) 



The "smoothness" measure above leads to a convenient prior distribution on T that favors smoothly 
varying functions: 



a 2 
V T {p) ocexp(-Q(K-,5)))exp(--%). 

za 

Here, ciq is the constant term in the polynomial p, whose coefficients are a = [ciq ci\ ... ax] . Al- 
though ao does not affect the "smoothness" measure in Equation 13, we impose on it a Gaussian 
distribution with variance <7q so Vj^(p) integrates to 1 over all polynomials in T like a true prob- 
ability density function. One can show (see Appendix A. 1.4 for detailed derivation) that Vf(p) 
has the following general form similar to Equation 9, the priors on polynomials with independent 
Gaussian distributed coefficients: 



1 , 1. 

(2tt)( a "+ 1 )/ 2 |S^| 1 /2 



Vr(j>(;a)) = 7V(a) = TTTZ^WTTTJ^rim ex P(" n S S / 



The new covariance term Sjr is as given below: 



£^(*,j) 



IK 2 i£i=j = l 

2 ( '7+ ) j ( !; 1) (4"i J " 3 - 4V" 3 ) if 2 < i < ii' + 1 and 2 < j < ii' + 1 (14) 

otherwise 



For this third case, we use an 8 th degree (i.e. K = 8) polynomial function class with <7o = 0.8 
in its "smoothness" prior to approximate a target polynomial class of similar degree with Gaussian 
priors Uj = 0.8 J+1 . Although the approximation and target function classes have prior distributions 
that differ somewhat in form, both priors are qualitatively similar in that they favor smoothly 
varying polynomials. 

For all three cases of slightly incorrect priors described above, we compare our active learner's 
sample complexity with that of a passive learner which receives random samples according to a 
uniform distribution on [^lo, 2?hi]- We repeat the active versus passive function learning simulations 
performed earlier by generating 1000 target polynomials, collecting noisy data samples (a s = 0.5), 
computing regularized estimates, averaging and comparing their approximation errors in the same 
way as before. Figure 5 plots the resulting average integrated squared difference error rates over a 
range of data sample sizes for all three cases. Despite the incorrect priors, we see that the active 
learner still outperforms the passive strategy. 
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4.2.4 Distribution of Data Points 

Notice from Equations 10, 11, 12 and 14 that the tota/ output uncertainty measure U(gn+i\V n , x n+i, 
for polynomial approximators (i.e., Equation 11) does not depend on the previous y data values ac- 
tually observed, but only on the previous input locations sampled. In other words, the previously 
observed y data values do not affect Xn+i, the optimal location to sample next. One can show 
that this behavior is common to all approximation function classes that are linear in their model 
parameters [10] [19]. 

Given a polynomial approximation function class of maximum degree K , one can thus pre- 
compute the sequence of input locations that our active learner will sample to gain maximum 
information about the unknown target. There are two sampling trends that are noteworthy here. 
First, the active strategy does not simply sample the input domain on a uniform grid. Instead, it 
chooses to cluster its data samples typically around K + 1 locations. Figure 6 shows the first 50 
input locations the active learner selects as K varies from 5 to 9, for a fixed data noise level of 
a s = 0.1. One possible explanation for this clustering behavior is that it takes only K + 1 data 
points to recover a K degree target polynomial in the absence of noise. 

Second, as the data noise level a s increases, although the number of data clusters remains fixed, 
the clusters tend to be distributed away from the input origin. Figure 7 displays the first 50 input 
locations the active strategy selects for a 9 th degree polynomial approximation function class, as a s 
increases from 0.1 to 5.0. One can explain the observed behavior as follows: For higher noise levels, 
there is less pressure on the active learner to fit the data closely. Consequently, the prior assumption 
favoring polynomials with small coefficients dominates. For such "lower order" polynomials, one 
gets better "leverage" from data by sampling away from the origin. In the extreme case of linear 
regression, one gets best "leverage" by sampling data at the extreme ends of the input space. 

4.3 Gaussian Radial Basis Functions 
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Our final example looks at an approximation function class T of d- dimensional Gaussian radial 
basis functions with K fixed centers. Let Gi be the i basis function with a fixed center cj and 
a fixed covariance Si. The model parameters to be learnt are the weight coefficients denoted by 
a = [ci\ ci2 • • • cirY ■ An arbitrary function r £ T in this class can thus be represented as: 



r(x, a) = Y^ a & % 

8 = 1 

K 



We impose a prior 7V() on the approximation function class T by putting a zero-centered Gaussian 
distribution with covariance Sjr on the model parameters a. Thus, for an arbitrary function r(-, a): 

1 1 T 

7V(r(-,a)) = V T {a) = ^—^—^ exp(--a S^ a). 

Lastly, the learner has access to noisy data of the form V n = {(xi, yi = g(xi) + r]) : i = 1, . . . , n}, 
where g is an unknown target function and r] is a zero-mean additive Gaussian noise term with 
variance <7 2 . Thus for every candidate approximation function r(-,a) £ J 7 , V(V n \r(-,a)) has the 
form: 



V(V n \r(-,a)) oc exp I -— ^(y 8 - r(^, a)) 2 j 

/ 1 " * exp(-i(f;-cl) T l 5 t - 1 (^-cl)) 

= -p(^X>-|> (2^^1^11/2 

A' \ 



1 

eXP I ~ 2^2 ^^ ~ ^ a *^( f «')) 

\ s 8 = 1 t = l 



Given a set of n data points X> n , one can obtain a maximum a-posteriori (MAP) solution 
to the learning problem by finding a set of model parameters a that maximizes V(r(-,a)\V n ) = 
Vr(r(;S))V(V n \r(;c!)). Let: 

Zi = [Gl(xi) G2{xi) . . . GK(xi)] T 



be a vector of RBF kernel output values for the i th input value. One can show (see Appendix A. 2.1), 
as in the polynomial case, that the a-posteriori RBF approximation function class distribution 
V(r(-, a)\V n ) is a multivariate Gaussian centered at a with covariance T, n , where: 

1 n 
£; 1 = £/ + -:C(z i z i T ) ( 15 ) 

S 8 = 1 



1 

a=T ln (—J2z- 1 y t ) (16) 

a s 8 = 1 
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Notice that a of Equation 16 is also the MAP solution the learner proposes on the basis of the 
data set V n , regardless of how the data points are selected. We now describe an active strategy for 
selecting the data optimally. 

4.3.1 The Active Strategy 

Recall that our goal is to derive an analytical expression for L((gn+i\D n , Xn+i) in Equation 7, 
the total output uncertainty cost function to minimize that yields the optimal location for sampling 
next. As before, we go through the general active learning procedure outlined in Section 3.4 to 
derive an exact expression for Xn+i, the optimal next query point. 

The first derivation step is to obtain an analytical expression for V(a\V n ), the a-posteriori RBF 
approximation function class distribution. This is exactly V(r(-,a)\V n ) which we introduced in the 
series of equations above leading to Equation 16. 

Deriving the RBF total output uncertainty cost function L((gn+i \T> n , £ n +i) requires several steps 
(see Appendix A. 2. 2 and A. 2. 3). We eventually get: 

U(gn +1 \V n ,Xn +1 ) oc \T, n+1 \. (17) 

S„+i has the same form as T, n in Equation 16 and depends on the previous data sample locations 
{£i : i = 1, . . . , n}, the model priors Sjr, the data noise variance af, and the next sample location 
£ra+i • When minimized over i n ^i, we get x^+x as the maximum utility location where the active 
learner should next sample the unknown target function. 

Like the polynomial class example, our RBF approximation function class T is also linear in its 
model parameters. As such, the optimal new sample location x^+x does not depend on the y data 
values in V n , but only on the previously sampled x values. 

4.3.2 Simulations — Error Rate versus Number of Examples 

Does the active strategy for our RBF approximation function class take fewer examples to 
learn an unknown target than a passive learner that draws random samples according to a uniform 
distribution on the input domain? We compare sample complexities for the active and passive 
learners under the following two conditions: 

1. The approximation and target function classes have identical priors. For simplicity, 
we perform our simulations in a one- dimensional input domain [^lo^hi] = [—5,5]. The 
approximation and target function classes are RBF networks with K = 8 fixed centers, 
arbitrarily located within the input domain. Each RBF kernel has a fixed 1-dimensional 
Gaussian "covariance" of Si = 1.0. Finally, we assume identical independent Gaussian priors 
on the model parameters a, i.e. Sjr = Ik = Is, where Ik stands for a K X K identity 
covariance matrix. 

2. The approximation and target function classes have slightly different priors. We 

use a similar RBF approximation function class with K = 8 fixed centers and a similar 1- 
dimensional Gaussian kernel "covariances" of Si = 1.0 for the centers. Each center is slightly 
displaced from its true location (i.e. its location in the target function class) by a random 
distance with Gaussian standard deviation a = 0.1. The learner's priors on model parameters 
(Sjr = 0.918 ) are also slightly different from that of the target class (Sjr = Is). 

The two simulations proceed as follows: We randomly generate 5000 target RBF functions 
according to the target model priors described above. For each target function, we collect data 
sets with noisy y values (a s = 0.1) ranging from 3 to 50 samples in size, using both the active 
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and passive sampling strategies. We then obtain a regularized estimate of the target function for 
each data set using Equation 16, and finally, we compute the actual integrated squared difference 
between the target and its estimate as an approximation error measure. The graphs in Figure 9 
plot the average error rates (over the 5000 different target functions) for both the active and passive 
learners as a function of data sample size. The upper graph shows the learner using exact priors, 
i.e, that of the target class, while the lower graph is for the case of slightly incorrect priors. In 
both cases, the active learner has a lower average error rate than the passive learner for the same 
number of data points. This is especially true for small data sets. 

5 Sufficiency Conditions for Pre-Computing a Data Sampling Sequence 

It is noteworthy that both for learning RBF weights as well as polynomial coefficients, the new 
optimal sample location, x n +i, does not depend on the yi data values previously observed but only 
on the Xj- values sampled. Thus, if the learner were to collect n data points, it can pre-compute 
the exact sequence of n points at which to sample from the start, even before receiving any data 
from the target function. This behavior has been observed by MacKay [10] for an active example 
selection strategy that minimizes only a model parameter variance cost function. For such cost 
functions, any class of approximation functions that are linear in their model parameters would 
exhibit such behavior. 

In our framework, we minimize an output uncertainty cost function that includes both bias and 
variance terms. The following theorem provides sufficiency conditions on the learning problem for 
which our active learning formulation leads to a data selection strategy that does not depend on 
previously observed yi data values. 

Theorem 2 Suppose T is a class of real-valued functions parameterized by a £ $l k . On the basis 
of a data set V n = {(x 8 ,y 8 ) : i = 1, . . .,n}, let the MAP solution to the learning problem be given 
by a = argmin ae sjj/: P(g(a\V n )). Then the following three conditions guarantee that the choice of 
x n+ i will be independent of the previously observed yi 's in V n . 

1. P(g(a)\D n ) can be expressed as Q((& — &(V n )), {x 8 - : i = 1 . . . n}) where Q is some arbitrary 
function that does not depend on the data, V n . In other words, the yi terms of V n do not 
appear anywhere else in P(g(a)\D n ) = Q((a — a(V n )), {x 8 - : i = 1 . . . n}) other than in a. 

2. T is linear in its parameters a, i.e.: g ai+a2 (x) = g ai (x) + g &2 (x). 

3. The prior distribution on model parameters a must have support equal to 5c\ 

As it turns out, if the conditions of the above theorem are met, the output uncertainty function 
U of our active learning formulation can also be analytically solved. Note that these conditions 
are sufficient to guarantee analytical tractability. They are by no means necessary and it would be 
useful to shed stronger light on the kinds of function classes and probability distributions for which 
exact active algorithms exist. 

6 Active Example Selection and the "Boot-strap" Paradigm 

Recall from Section 3.3 that our active learning strategy chooses its next sample location by 
minimizing the total output uncertainty cost function in Equation 7. For convenience, we reproduce 
the relevant expressions below: 

/•oo 

U{g n+1 \V n ,Xn+i) = I V{y n+ i\xn+i,V n )Ejr[6(gn + i,g)\V n \j(xn+i,yn+i)]dy n +i- (18) 
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where: 

Er[6(g,g)\V n ]= [ V(g\V n )6(g,g)dg = f Vr(g)V(V n \g)6(g,g)dg. 

and: 

V{y n+ i\xn + i,V n ) ex / P(V n U (xn +1 ,y n+1 )\f)VF(f)df. 

Clearly, from the three equations above, the cost function L((g n+ i \V n , £ n +i) may not have a simple 
analytical form for many approximation function classes. The current active learning formula- 
tion may therefore be computationally intractable for arbitrary approximation function classes in 
general. 

One way of making the active learning task computationally tractable is to define simpler but 
less "complete" cost functions for measuring the potential utility of new sample locations. To 
conclude this paper, we look at one such simplification approach and show how it leads to the 
"boot-strap" example selection strategy we used for training object and pattern detection systems. 
We also show empirically that even though the "boot-strap" strategy may be "sub-optimal" in its 
choice of new examples, it still outperforms random sampling in training a frontal face detection 
system. As such, we maintain that the "boot-strap" strategy is still an effective means of sifting 
through unmanageably large data sets that would otherwise make learning intractable. 

6.1 A Simpler Example Utility Measure 

Let g be an unknown target function that we want to estimate by means of an approxima- 
tion function in T , V n = {(xi, yi) G !K d X $l\i = l,...,ra} be the set of n data points seen so 
far, and g n £ T be the current regularized estimate for g. Recall from Section 3.3 that our 
learning goal is to minimize an expected misfit notion between g and its regularized solution, and 
our optimal sampling strategy chooses the next input location x^+i £ 3? that best minimizes 
Ejr[S(gn+i, g)\T> n U (x^+i,y n +i)], the EISD between g and its new estimate gn+i- 

We now describe a different example selection heuristic, based on a simpler but less comprehen- 
sive example utility measure, that also attempts to efficiently reduce the expected misfit between g 
and gn+i- Let C(g" n \V n ,x) be a local "uncertainty" measure for the current estimate g n at x: 



£(g n \V n ,x)= V(g\V n )(g n (x)-g(x)Ydg (19) 

JgdT 

Notice that unlike the EISD which globally characterizes an entire approximation function, C(g" n \V n , : 
is just a local "error bar" measure between g and g n only at x. In a loose sense, one can view the 
local "error bar" C(g" n \V n ,x) as information that the learner lacks at input location x for an ex- 
act solution. Given such an interpretation, one can also regard C(g" n \V n ,x) as an example utility 
measure, because the learner essentially gains the information it lacks at x by sampling there. The 
new example selection heuristic can thus be formulated as choosing the next sample where the new 
example's utility value is greatest, i.e., sampling next where the learner lacks most information: 

*„-„ = «gm« W.,*) = »r S « 1 lW,m*> ~ »(*))'* (20. 
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We stress again that the new example selection heuristic differs from our original active learning 
formulation only in the example utility cost function it uses. The new heuristic uses a simpler 
example utility measure, based on the current estimate's local uncertainty properties instead of 
the new estimate's expected global uncertainty properties. In doing so, it implicitly assumes the 
following: 

1. The learning goal is still to minimize the expected misfit, i.e., the total output uncertainty 
between g and its estimate g. 

2. One can bring about a proportionate decrease in the new estimate's global output uncertainty 
level by locally reducing the current estimate's output uncertainty at some arbitrary location. 

3. There is a better chance of significantly reducing the local output uncertainty at a point whose 
current uncertainty level is high. Furthermore, one can best reduce the local uncertainty level 
at a point by sampling directly at the point. 

4. It follows from the previous assumptions that one can most efficiently minimize the new 
estimate's global uncertainty level by gathering new data where the current estimate's local 
output uncertainty level is highest. 

Although the above assumptions appear intuitively reasonable, one should still be able to find 
approximation function classes that do not meet the above assumptions. This suggests that in 
general, the new heuristic may still be a "sub-optimal" sampling strategy with respect to the active 
learning goal of maximally reducing the expected misfit between g and g with each new data point. 
Nevertheless, Equation 19 is clearly a much simpler example utility cost function than Equation 18, 
which makes the new heuristic computationally tractable for a much larger range of approximation 
function classes. 

Are there approximation function classes for which the new heuristic and the original active 
example selection strategy are functionally equivalent? MacKay [10] has shown that if one ap- 
proximates the current a-posteriori model parameter distribution, i.e. V{a\V n ) = V(g(-,a)\V n ), 
as a multi- dimensional Gaussian probability density centered at a, the optimal model parameter 
estimate, then minimizing the new estimate's global uncertainty level reduces to sampling where the 
current estimate's "error bars" are greatest. MacKay has also observed that for linear approxima- 
tion function classes (i.e., one for which g(x,a) = J2i^=i a i' l Pi(x)) with quadratic penalty functions, 
V{a\V n ) is exactly a multi- dimensional Gaussian probability density centered at a, which in turn 
suggests that the two sampling strategies are computationally equivalent for such approximation 
function classes. We refer the interested reader to MacKay's work [10] for further details. 

6.2 Example Selection in a Real Pattern Detection Training Scenario 

We now discuss a variant of the simplified example selection heuristic, used for training a frontal 
view human face detection system. In order to train a face detection system with finite computation 
resource, one must first acquire a comprehensive but tractably small database of "face" and "non- 
face" example patterns. For "face" patterns, one can simply collect all frontal face views from 
mugshot databases and other image archives, and still have a manageably small data set. For 
"non-face" patterns, the task is more tricky. In essence, any normalized window pattern that does 
not tightly contain a frontal human face is a valid "non-face" training example. Clearly, our "non- 
face" example set can grow intractably large if we should include every available "non-face" image 
patch in our training database. 

Notice that our learning scenario for face detection differs slightly from the original active 
learning scenario presented earlier. In the original setting, one assumes that data measurements 
are relatively expensive or slow, and we seek the next sample location that best maximizes the 
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expected amount of information gained. In our current scenario, we have an immense amount of 
available "non-face" data from which we wish to select a small training sample most useful for our 
learning task. The learner in our face detection scenario also has an added advantage: it already 
knows the actual output value (i.e., class label) of every candidate "non-face" data point even 
before they are selected. 

Clearly, the current learning scenario reduces to the original active learning setting if we ignore 
output values (i.e., class labels) of the candidate data points when deciding which new patterns to 
select. Despite apparent differences in form, both learning scenarios address the same central issue, 
namely how a learner can select new examples intelligently by estimating the utility of candidate 
data points. In fact, we shall see shortly that one can still borrow ideas developed for the original 
active learning scenario to approach example selection in the face detection scenario. 

6.3 The "Boot-strap" Paradigm 

To constrain the number of "non-face" patterns in our training database, we introduce a "boot- 
strap" paradigm that incrementally selects "non-face" patterns highly relevant to the learning 
problem. The "boot-strap" strategy reduces the number of "non-face" patterns needed to train a 
highly robust face detector. We outline the idea below: 

1. Start with a small and possibly highly non-representative set of "non-face" examples in the 
training database. 

2. Train a face classifier to output a value of '1' for face examples and '0' for non-face examples 
using patterns from the current example database. 

3. Run the trained face detector on a sequence of images with no faces. Collect all (or a random 
subset of) the "non-face" patterns that the current system wrongly classifies as "faces" (i.e., 
an output value of > 0.5). Add these "non-face" patterns to the training database as new 
negative examples. 

4. Return to Step 2. 

More generally, one can use the "boot-strap" paradigm to select useful training examples from 
either pattern class in an arbitrary pattern detection problem: 

1. Start with a small and possibly highly non- representative example set in the training database. 

2. Train a pattern classifier to output a value of '1' for positive examples and '0' for negative 
examples using patterns from the current example database. 

3. Run the trained pattern classifier on a sequence of images. Collect all (or a random subset 
of) the wrongly classified patterns and add them to the training database as new correctly 
labeled examples. 

4. Return to Step 2. 

At the end of each iteration, the "boot-strap" paradigm augments the current data set with new 
patterns that the current system classifies wrongly. We argue that this strategy of collecting 
wrongly classified patterns as new training examples is reasonable, because one can expect these 
new examples to improve the classifier's performance by steering it away from its current mistakes. 
One can reason about the "boot-strap" paradigm as a variant of the previously discussed sim- 
plified example selection heuristic. First, as in the original active learning spirit, "boot-strapping" 
attempts to select only high utility examples for training. During, each example selection step, if 
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"boot-strapping" were to follow the active learning procedure exactly, then it should select only the 
highest utility example available and continue the training process. Instead, we make the "boot- 
strap" paradigm less restrictive by allowing the learner to select one or more "high utility" examples 
between training cycles. Notice that the highest utility example may not even be among those se- 
lected. Second, the simplified heuristic estimates an example's utility by computing C(g n \V n ,x), 
the local "error bar" measure of Equation 19. In "boot-strapping", we take advantage of the already 
available output value (i.e., class label) at each candidate location to implement a coarse but very 
simple local "error bar" measure for selecting new examples. Points that are classified correctly 
have low actual output errors. We assume that these points also have low uncertainty "error bars" 
and so we ignore them as examples containing little new information. Conversely, points that are 
wrongly classified have high actual output errors. We conveniently assume that these points also 
have high local uncertainty "error bars" and are therefore "high utility" examples suitable for the 
learning task. We stress again that we are assuming actual output errors and local uncertainty 
"error bars" are highly correlated measurements, which may not always be a valid assumption. 

6.4 Sample Complexity of "Boot-strapping" 

Does the "boot-strap" strategy yield better classification results with fewer training examples 
than a passive learner that receives randomly drawn patterns? We show empirically that this is 
indeed the case for our frontal face detection learning scenario. Using a Radial Basis Function-like 
network architecture for pattern classification, we trained a frontal view face detection system (see 
[20] for details). This was done in two "boot-strap" cycles. The final training database contains 
4150 "face" patterns and over 40000 "non-face" patterns, of which about 6000 were selected during 
the second "boot-strap" cycle. As a control, we trained a second system without "boot-strapping". 
The second training database contains the same 4150 "face" patterns and another 40000 randomly 
selected "non-face" patterns. We used the same "face" and "non-face" Gaussian clusters from the 
first system to model the target and near-miss pattern distributions in the second system. 

We ran both systems on a test database of 23 cluttered images with 147 frontal faces. The first 
system missed 23 faces and produced 13 false detects, while the second system had 15 missed faces 
and 44 false alarms. Notice that the first system trained with "boot-strapping" has a lower total 
mis-classification rate than the control trained without "boot-strapping". The "boot-strap" system 
misses more faces but has a much smaller false alarm rate for non-faces. This is because we have 
used "boot-strapping" to only select better "non-face" patterns while leaving the total number of 
"face" patterns unchanged. Consequently, the system is better able to reject non-face patterns at 
a slight expense of correctly detecting faces. 

Our comparison suggests that even though the "boot-strap" strategy may be "sub-optimal" in 
choosing new training patterns, it is still a very simple and effective technique for sifting through 
unmanageably large data sets for useful examples. 

7 Conclusion 

We have developed a framework for choosing examples usefully and investigated its plausibility 
as a reasonable paradigm for active learning. This framework rests on techniques from Optimal 
Experiment Design, generalizes the work of MacKay in a Bayesian setting and is similar to the 
work of Sollich in a non-Bayesian one. We then show how precise algorithms can be derived for 
certain function classes and explore the improved perfomance of these algorithms. 

There are a number of useful directions to explore further. First, it is useful to understand 
what sort of function classes would yield tractable solutions to the objective function criteria set 
up in our framework. We have attempted a partial answer to this, but the analysis is by no means 
complete. Another useful direction is to consider forms of activity other than example selection 
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System trained without Bootstrapping System trained with Bootstrapping 

Figure 10: Comparing face detection results for two systems: one trained without "boot-strapping" and the 
other with "boot-strapping" . The system trained without "boot-strapping" does a poorer job at discarding non- 
face patterns. Top pair: The system without "boot-strapping" finds the frontal face correctly but also makes 
four false detects — three near the top-left image corner and one near Mia Hamm's left knee. Bottom pair: 
The system without "boot-strapping" makes two false detects in the complex background, left of Iolaus' face. 
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that might be useful in practice. We have made no attempt to discuss this issue in the current 
paper. 

There have been some previous attempts to develop theoretical frameworks and analyses of 
active learning situations. It is crucial that these frameworks eventually translate into some sort of 
working algorithms that can be used to solve complex learning problems. As an attempt to bridge 
the gap between theory and practice, we have argued how the bootstrap paradigm can be derived 
from our framework and implemented with some success in a sophisticated face detection system. 

A The Active Learning Procedure 

This appendix derives the active example selection procedures for polynomial approximators 
and Gaussian radial basis functions. Specifically, we show the steps leading to Equations 11 and 17, 
i.e., the total output uncertainty cost functions L((g n+ i\V n , Xn+i) for polynomial approximators and 
Gaussian RBFs respectively. 

A.l Polynomial Approximators 

Let T be a univariate polynomial approximation function concept class with maximum degree 
K: 

K 

T = {g(x,a) = g(x,a ,...,a K ) = ^a.x 1 }. 

8=0 

The model parameters to be learnt are a = [ao «i • • • «A'] T and x £ [x^q, 2?hi] is the input variable . 
The prior distribution on T is a zero-mean Gaussian distribution with covariance Sjr on the model 
parameters a: 

1 1 T -1 

Vr(g(;S)) = V T {a) = (27r)(A - +1)/2|s ^ |1/2 exp(--a S^ a). (21) 

Our task is to approximate an unknown target function g £ T within the input range [^lo, 2?hi] on 
the basis of noisy sampled data: V n = {(xi,yi = g{xi) + rf)\i = 1, . . . ,n}, where r] is an additive 
zero-mean Gaussian noise term with variance a 2 s . 

A. 1.1 The A-Posteriori Function Class Distribution 

We first derive the a-posteriori distribution on function class T given data V n , i.e., V{a\V n ) oc 
Vj^(a)V(V n \d). Since V n is sampled under additive zero-mean Gaussian noise with variance af, we 
have: 



V(V n \a) oc exp\——^J2(yj-9(xj,a)) 

\ Z(Ts 3 = 1 



= ex p {-^Eiy-E^f} ( 22 ) 

V lGs 3 = 1 t=0 J 

Let x; = [1 Xj x 2 ; . . . x^Y be a power vector of the j th input value. One can expand the exponent 
term in Equation 22 as follows: 
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So: 
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The polynomial prior distribution Vj^{a) is given in Equation 21. The a-posteriori distribution 
is thus: 



V{a\V n ) ex V T {a)V{V n \a) 



ex 
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s j = l 
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s j = l 
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s j = l 



s j = l 



(23) 



Completing the square in Equation 23 yields: 



V{a\V n ) ex exp 



1 / - -T - 1 n 



(24) 



where: 



■t ,f T^ 



^ _1 = x? 1 + -^ E( x j x . 



J = l 



E »(^X>j^ 



s j = l 



(25) 
(26) 
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Notice that neither of the two terms a S n x a and -^iYll=i y\ XVl Equation 24 depend on the 
polynomial model parameters a. This means that we can rewrite Equation 24 as: 



V{a\V n ) ex exp 



(a — a) S n (a — a) 



(27) 



Clearly, Equation 27 is multivariate Gaussian in form. To express V{a\V n ) as a standard proba- 
bility distribution on a that integrates to 1, we simply introduce into Equation 27 the appropriate 
normalizing constants: 



V(a\V, n 



1 



(27r)( A "+ 1 )/ 2 |S n | 1 /2 



exp 



1 



(a — a) S n (a — a) 
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Thus, the polynomial a-posteriori function class distribution is a multivariate Gaussian centered 
at a (Equation 26) with covariance T, n (Equation 25). 



A. 1.2 The Polynomial EISD Measure 

Recall from Section 3.3 that the total output uncertainty cost functions L((g n+ i\V n , Xn+i) is 
simply the expected EISD measure between g and its new estimate gn+i, if the learner samples next 
at Xn+i- We now derive an expression for the EISD between g and its current estimate g n given 
V n . We shall use this result later to derive an expression for h((g n+ i\V n ,x r ^ + i). 

The expected integrated squared difference (EISD) between an unknown target g and its estimate 
g given V n is: 



Er[6(g,g)\V ri 



ger 



V(g\V n )S(g,g)dg 



where S(g,g) is a standard integrated squared difference measure between two functions over the 
input space $l d or some appropriate region of interest: 



Hg,g) 



e^ d 



(g(x) — g(xj) dx. 



For our polynomial approximation function class, the optimal estimate for g given V n has 
model parameters a (Equation 26), since this is where V{a\V n ) has a global maximum. Let 
a = [do d\ ... ak'] T and x = [1 x x 2 ... a; 11 ] 1 , one can rewrite 8(g,g) in terms of polynomial model 
parameters as: 



8(a, a) 



LO 
LO 



[g(x, a) — g(x, a)] dx 



.8 = 



dx 



LO 
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(a — a) A(a — a) 



f29) 
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where A is a constant (K + 1) X (K + 1) matrix of numbers whose (i,j) element 



is: 



Mhi) 



"HI 



LO 



:^~ 2 Ux 



Substituting Equations 28 and 29 into the EISD expression, we get: 



Er[6(g,g)\V n ] = E^[6(a,a)\V r , 



3e& K+1 



V(a\V n )8(a, a)da 



1 



W- +1 (2vr)(^+i)/2|S n |i/2 eXP 
(a — a) A(a — a) da 



- ((a-aY^Ha-a) 



(30) 



Making the following change of variables: q = A?(a — a), and noting that the integration bounds 
on qis still !R /i+1 , Equation 30 becomes: 



Er[6(g,g)\V n ] = E^[6(a,a)\V n 
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since A is just a constant matrix of numbers. 

Notice from Equation 25 that T, n depends only on the polynomial function class priors Sjr, the 
output noise variance a 2 s and the previously sampled input locations {x\,X2, ■ ■ -,x n }. It does not 
depend on the previous y data values actually observed. In other words, the previously observed 
y data values in V n do not affect the EISD measure (Equation 31) for this polynomial function 
concept class! 

A. 1.3 The Total Output Uncertainty Measure 

The total output uncertainty cost function resulting from sampling next at x^+x is given by 
Equation 7. We rewrite the expression below in terms of our polynomial model parameters: 



K(g n+1 \V n ,x n+1 ) = / V(y n+1 \x n+1 ,V n )E^[8(a,a)\V n U (x n+1 ,y n+1 )] dy n+1 . (32) 



where: 
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aE$t K + 1 



V{V n U {x n+1 ,y n+1 )\a)Vjr{a) da. 



It is clear from Equation 32 that L((g n+ i\V n , x n+ i) is merely the expected EISD value between 
g and its new estimate, weighted and averaged over all possible values of y n +i a t x n +\. Recall from 
Equation 25 however, that for this polynomial function class, the EISD between g and its estimate 
g depends only on the input X{ values in V n and not on the observed yi values. This means that 
Ef[8(3, a)\V n U (x n+ i , y n+ i)], the new EISD resulting from sampling next at x n +\, does not depend 
on y n +il Equation 32 can therefore be further simplified, which leads to the following closed form 
expression for the total output uncertainty cost function, given also in Equation 11: 



U(g n+1 \V n ,x n+1 ) = E^[8(a,a)\V n \j(x n+1 ,y n+1 )] V(y n+1 \x n+1 ,V n ) dy n+1 

= E^[6(a,a)\V n U(x n+1 ,y n+1 )] 
= |S„+iA| ex |S n+ i| 



(33) 



Here, S n +i has exactly the same form as T, n in Equation 25, and depends only on the polynomial 
function class priors Sjr, the output noise variance a 2 s and the data input locations {x\, X2, • • • , x n , x n+ i}. 



A. 1.4 The Polynomial Smoothness Prior 

For our polynomial function class, we have assumed the following multi- dimensional Gaussian 



prior distribution on model parameters a = [ao «i 



-IT. 



(2tt)( a "+ 1 )/ 2 |S^| 1 A 



a K 



■exp( a S^f a), 



where Sjr is a (K + 1) X (K + 1) covariance matrix. If one assumes an independent Gaussian 
distribution with variance af on each parameter a,-, then Sjr is simply a diagonal matrix with the 
independent variance terms {crf\i = 0, . . . , K} on its principal diagonal. 

Let p G T be a polynomial function in the approximation concept class. We consider here a 
slightly different prior distribution on T based on a global "smoothness" measure: 



Vf{a) oc exp(-Q(p(-,a)))exp(- 
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(34) 



where the "smoothness" term is: 



Q(p(-,«)) 
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We shall show that despite the apparent difference in general form, Vj^{a) still simplifies to a multi- 
dimensional Gaussian distribution, whose new covariance term Sjr is given by Equation 14. First 
we express Q(p(-,a)) in terms of the polynomial model parameters a: 
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Next, we substitute the above result into the exponent of Equation 34. We get: 
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where the K X K matrix N is as defined above. Thus, from Equation 36, we see that the new 
prior distribution Vj^{a) is indeed multi- dimensional Gaussian in form with covariance Sjr as given 
below and in Equation 14: 



X?\i,j) 



r y* 2 o 

2 ('-l)(J-l)/- J+3- 







if i = j = 1 
ii2<i< K 
otherwise 



1 and 2 < j < K + 1 



(37) 
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A. 2 Gaussian Radial Basis Functions 

Our next example is a d- dimensional Gaussian Radial Basis Function class T with K fixed cen- 
ters. The analysis here is very similar to the polynomial case presented in the previous section. Let 
Gi be the i basis function with a fixed center cl and a fixed covariance Si. The model parameters 
to be learnt are the RBF weight coefficients denoted by a = [a\ a^ ... <irY ■ An arbitrary function 
r G T in this class can be represented as: 



r(x,a) = ^2a,iGi(x) 

8=1 

K 1 1 

= X> (2^/21^1/2 e M~^~ CifSi x (£- c-)) 

The prior distribution on T is a zero-mean Gaussian distribution with covariance Sjr on the model 
parameters: 

7V(r(-,a)) = V T {a) = ^—^L—^ex^-^^a). (38) 

Lastly, the learner has access to noisy data of the form V n = {(xi, yi = g(xi) + rf) : i = 1, . . . , n}, 
where g G T is an unknown target function and tj is a zero-mean additive Gaussian noise term with 
variance a 2 . The learning task is to recover g from V n . 

A. 2.1 The A-Posteriori Function Class Distribution 

We first derive the a-posteriori distribution on function class T given data V n , i.e., V{a\V n ) oc 
Vj^(a)V(V n \d). Since V n is sampled under additive zero-mean Gaussian noise with variance a 2 , we 
have: 



V(V n \a) oc exp -— ^( yj - r(x 3 ,a)) 

\ lGs 3=1 



( 1 " , * ex.p(-\(x' j -ci) T S t 1 (x' j -ci)) \ 

= exp r^S fe "S af — wv — ] ) 
t i n k \ 

= ex P -^jEfe -^2 a tSt(xj)) 2 (39) 

\ s 3 = 1 t=l J 

Let Zj = [Gi(xj) G2( x "j) ■ ■ ■ Gk{xj)Y be a vector of kernel output values for the j th input value. 
One can expand the exponent term in Equation 39 as follows: 



K i K \ 2 A' 

(yj ~ Y, at^t(^)) 2 = y] + [Yl a t&t(*j) - 2 Vj Y, atG, 
t=i \t=i / t=i 



y) + a T (zjZj T )a - ^Zj T a - a T Zj^- 
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So: 



K 



-i E [yj - E a M*i ) = -i E ^1 + « T \-i YJ^'A 



s j = l 



t=l 



s j=l 



s j=l 



:^i E ^ T )« - « T (^2 E *jyj 



■s j = l 



■s j = l 



The Gaussian RBF prior distribution Vj^{a) is as given in Equation 38. The a-posteriori distri- 
bution is thus: 



V(a\V n ) ex VAa)V(V n \a) 

1 

2 



oc exp \ --(fH^a) exp ( - — Y^iVi ~ E a *^(^')) 2 

« j=l t=l 



exp 



1 / 1 n 1 n 

.-1^+ Etf + iT/g, 



zjZj j a 



1 



^E>; 



z; )a — a ( 



1 



s j=i 



^E 2 ^ 



exp 






2 (7 



2/f + « T 



^ + iD 



s j=l 



zjZj j a 



s j=i 



s j=i 



v^2 E %^j T )» " ^(^2 E Wi 



s j = l 



s j = l 



(40) 



Completing the square in Equation 40 yields: 



V{a\V n ) ex exp 



1 / „ „ T ~ 1 " ' 

- (a-a) T £,/(a-a)-a S n x a + — ^ yj 



(41) 



where: 



1 n 

s j=l 



(42) 



a = XL 



Ew; 



s j = \ 



(43) 



Notice that as in the polynomial case (see Appendix A. 1.1), neither of the two terms a S n x a 
and -\ J2j=i v\ XVl Equation 41 depend on the RBF model parameters a. This means we can simply 
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remove the two "constant" terms from the exponent and introduce into Equation 41 the appropriate 
normalizing constants so V{a\V n ) becomes a standard probability distribution on a: 



V(a\V n 



1 



(2vr) A 72|s n |i/2 



exp 



^ T s- X ( 



a — a 



(44) 



Thus, the RBF a-posteriori distribution is a multivariate Gaussian centered at a (Equation 43) 
with covariance T, n (Equation 42). 

A. 2. 2 The RBF EISD Measure 

We now derive an expression for the expected integrated squared difference (EISD) between an 
unknown RBF target g and its current estimate given V n . We shall use this result later to derive 
L((g n +i\V n , Xn+\), the total output uncertainty cost function for RBF approximators. 

The expected integrated squared difference (EISD) between an unknown target g and its estimate 
g given V n is: 



Er[6(g,g)\V n ]= V(g\V n )6(g,g)dg 

JgdT 



where S(g,g) is a standard integrated squared difference measure between two functions over the 
input space 3? or some appropriate region of interest: 



Hg,g) 



e^ d 



(g(x) — g(xj) dx. 



For our RBF approximation function class, the optimal estimate for g given V n has model 
parameters a (Equation 43), since this is where the a-posteriori distribution V{a\V n ) has a global 
maximum. Let a = [do d\ ... ak'] T and z = [Qi(x) Q2(x) . . . Gk(%)Y, one can rewrite 8(g,g) in 
terms of RBF model parameters as: 



8(a, a) 



e^ d 



e* d 



[r(x,a) — r(x,a)] dx 



e^ d 



{^2aiGi(x)) - (^2diGi(x)) 



i=l 



8 = 1 



dx 



J2( a i ~ d i)Gi{x) 



(a — a) 



8 = 1 

T 



dx 



e* d 



(a — a) zz (a — a) dx 



e^ d 



(a — a) A(a — a) 



zz dx) (a — a) 



(45) 



where A is a constant K X K matrix of numbers whose (i,j) element i 



is: 



Mhi)= I Gi{x)Qj{x)dx 

Jxe^ d 



Substituting Equations 44 and 45 into the EISD expression, we get: 
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Er[6{g,g)\V n 



Er[6{a,a)\V n ] 
1 



s&tK ( 27r )^/2|E„|i/2 



aE$t K 

exp 



V(a\V n )8(a, a) da 



- ( y {a-a) T Y ln 1 {a-a 



(a — a) A(a — a) da 



(46) 



Making the following change of variables as in the polynomial case: q = A?(a — a), and noting 
that the integration bounds on q is still SK 11 , Equation 46 becomes: 



Er[6(g,g)\V ri 



Er[6(a,a)\V n ] 



•&tK (27r) A 72|A| 1 /4|s n |i/2|A| 1 /4 



exp 



\ (qtA-^A-tq 
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q >em (27r)^72|s n A| 1 / 
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exp 



|X„A| 



2 

oc IS. 



T —I —1 —I 

q A 2 S n A 2q 



q q dq 



q q dq 



(47) 



since A is just a constant matrix of numbers. 

Notice from Equation 42 that S n depends only on the RBF function class priors Sjr, the K 
fixed Gaussian RBF kernels {Q^(-)\i = 1, . . .,K}, the output noise variance a 2 s and the previously 
sampled input locations {x\, X2, ■ ■ ■ , x n }. Like the polynomial case, it does not depend on the 
previous y data values actually observed. In other words, the previously observed y data values in 
V n do not affect the EISD measure (Equation 47) for this Gaussian RBF concept class! 

A. 2. 3 The RBF Total Output Uncertainty Measure 

The total output uncertainty cost function is simply the expected EISD between g and its new 
estimate gn+i, if the learner samples next at x^+x- The cost function is given by Equation 7. We 
rewrite the expression below in terms of our RBF model parameters: 



U(g n+1 \V n ,Xn +1 ) = / V(y n+1 \xn+i,T> n )Ejr[S(a,a)\V n \j(xn+i,y n +i)]dy n+1 . (48) 



where: 



V(y n+1 \xn +1 ,V n ) oc / V(V n U (xn +1 ,y n+1 )\a)Vf(a) da. 

Jde^ K 



It is clear from Equation 48 that L((g n+ i\V n , Xn+i) is merely the new EISD weighted and 
averaged over all possible values of y n +i a t 2^+1 • Recall from Equation 42 however, that for 
this RBF concept class, the EISD between g and its estimate g depends only on the input x~{ 
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values in V n and not on the observed yi values. This means that Ejr[6(a,a)\V n U (x^i, y n +\)], 
the new EISD resulting from sampling next at Xn+i, does not depend on y n +i! Equation 48 can 
therefore be further simplified, which leads to the following closed form expression for the total 
output uncertainty cost function, given also in Equation 17: 



U(g n+1 \V n ,Xn +1 ) = Ejr[S(a,a)\V n \j(xn+i,y n +i)] V(y n+1 \xn +1 ,V n ) dy n+1 

= E^[6(a,a)\V n U(xn +1 ,y n+1 )] 

= |E„+iA| a |E„+i| (49) 



Here, S n +i has exactly the same form as T, n in Equation 42, and depends only on the polynomial 
function class priors Sjr, the K fixed Gaussian RBF kernels {Q^(-)\i = 1, . . -,K}, the output noise 
variance a 2 s and the data input locations 

\^l? ^2? • • • i %m %n-\-lf- 
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